ftp.cs.arizona.edu

home *** CD-ROM | disk | FTP | other *** search

/ ftp.cs.arizona.edu / ftp.cs.arizona.edu.tar / ftp.cs.arizona.edu / icon / newsgrp / group98c.txt / 000009_icon-group-sender _Thu Sep 10 16:56:49 1998.msg < prev next >

Wrap

Internet Message Format | 2000-09-20 | 5KB

Return-Path: <icon-group-sender> Received: from kingfisher.CS.Arizona.EDU (kingfisher.CS.Arizona.EDU [192.12.69.239]) by baskerville.CS.Arizona.EDU (8.9.1a/8.9.1) with SMTP id QAA04970 for <icon-group-addresses@baskerville.CS.Arizona.EDU>; Thu, 10 Sep 1998 16:56:48 -0700 (MST) Received: by kingfisher.CS.Arizona.EDU (5.65v4.0/1.1.8.2/08Nov94-0446PM) id AA31454; Thu, 10 Sep 1998 16:56:21 -0700 Date: Fri, 11 Sep 1998 08:56:14 +1200 (NZST) From: "Richard A. O'Keefe" <ok@atlas.otago.ac.nz> Message-Id: <199809102056.IAA16557@atlas.otago.ac.nz> To: gep2@computek.net, icon-group@optima.CS.Arizona.EDU Subject: Re: Unicode support or support for non-Ascii based character manipulation? Errors-To: icon-group-errors@optima.CS.Arizona.EDU Status: RO Gordon Peterson (http://www.computek.net/public/gep2/) wrote: Okay, I don't dispute that this move is happening but personally I still don't very much like it. The fact is that (at least here in the Western Hemisphere, where probably most of the world's computers are used) an eight-bit byte is already quite sufficient for most purposes, and doubling it comes at a cost in complexity and storage (RAM, disk, tape, whatever) which is simply very, very hard to justify on any genuine economic basis. This is a fictitious problem. UNIX systems at least support UTF-8, which is a compression method described in ISO 10646 and the Unicode book that has the property that ASCII characters *still* occupy exactly one byte each. When I use getwc() on this system, it decodes UTF-8 files and gives me ISO 10646 wide characters internally. If other countries have more difficult (or huge) character sets, that is (while a fact of life) simply an inherent disadvantage of their culture (and note that I'm not intending that as a slam or value judgement, it just IS the way it is), and I don't see a terribly convincing argument why the other countries (without that disadvantage) ought to pay the price too, just in order to artificially level the playing field. Many people _within_ Weestern Europe do not have the luxury of dealing with only a single language. I cannot write my father's name in ASCII, nor my sister-in-law's. Both of them are (in my father's case, were) monoglot Anglophones born into monoglot Anglophone families in an English-speaking country. I _can_ write their names in ISO Latin-1, but I _can't_ write half of the place-names of this country! (The officially approved orthography for Maori puts a macron over long vowels, like the 'a' in Maori. There are no macrons in Latin-1.) Even if my text switched between Latin-1 family members, I _still_ wouldn't be able to write English, because the inverted comma and and double inverted comma quotation marks are not available, let alone en dashes and em dashes. The *only* character set around in which this functionally-monoglot Anglophone can write *in English* about the people and places around him is ISO 10646; even Latin-1 just isn't good enough FOR ENGLISH! Note that ISO C, ISO C++ (which finally exists), and the world's first standard object-oriented language Ada95 all support wide characters. (You need the Technical Corrigenda for ISO C to get getwc() &co.), and that UNIX and Windows NT allow Unicode file names. I also note that Icon (like SNOBOL before it) has been of particular interest to scholars in the humanities, who would, for example, like to put Hebrew _and_ Arabic in the same document with English, which is something you can't do in any ISO 8859 family member, not without code switching, which is much harder to deal with than Unicode. There is the pretty obvious point that within Europe, they are going to *have* to use the new "Euro" sign. (Why have the Europeans named their new currency after an Australian mammal?) That's U+20AC, and if there's an 8-bit character set that has it, please tell us which. I can certainly understand and appreciate the problems that the huge character sets used in some eastern countries have played for them Never mind eastern countries. What about an American businessman writing to an office in Germany about their operations in Russia? What about a theologian writing in English but quoting Hebrew and Greek frequently? What about an English professor writing a book in modern English about Old English (we've lost four letters, which can be found in Unicode but not any 8-bit character set I know of. Ash _is_ in Latin1, but eth, thorn, yogh, and wynn are not.) > Has anyone thought about this yet? What does string and pattern matching mean in, for example, Japanese? The real problem is the equivalence one would expect between precomposed characters and base characters + floating diacriticals. That's _really_ proctalgic. By the way, 16 bits isn't enough; there are proposals already far advanced in the pipeline for characters to go into Plane 1.